29 research outputs found

    Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

    Full text link
    Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs, specifically focusing on Generative models. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield remarkable improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes GPT generation with multilingual embeddings and achieves significant multilingual performance improvement on critical tasks like QA and retrieval. Finally, to further propel the performance of polyglot LLMs, we introduce a novel learning algorithm that dynamically selects the optimal prompt strategy, LLM model, and embeddings per query. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages

    MEGA: Multilingual Evaluation of Generative AI

    Full text link
    Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.Comment: EMNLP 202

    Non-invasive diagnostic tests for Helicobacter pylori infection

    Get PDF
    BACKGROUND: Helicobacter pylori (H pylori) infection has been implicated in a number of malignancies and non-malignant conditions including peptic ulcers, non-ulcer dyspepsia, recurrent peptic ulcer bleeding, unexplained iron deficiency anaemia, idiopathic thrombocytopaenia purpura, and colorectal adenomas. The confirmatory diagnosis of H pylori is by endoscopic biopsy, followed by histopathological examination using haemotoxylin and eosin (H & E) stain or special stains such as Giemsa stain and Warthin-Starry stain. Special stains are more accurate than H & E stain. There is significant uncertainty about the diagnostic accuracy of non-invasive tests for diagnosis of H pylori. OBJECTIVES: To compare the diagnostic accuracy of urea breath test, serology, and stool antigen test, used alone or in combination, for diagnosis of H pylori infection in symptomatic and asymptomatic people, so that eradication therapy for H pylori can be started. SEARCH METHODS: We searched MEDLINE, Embase, the Science Citation Index and the National Institute for Health Research Health Technology Assessment Database on 4 March 2016. We screened references in the included studies to identify additional studies. We also conducted citation searches of relevant studies, most recently on 4 December 2016. We did not restrict studies by language or publication status, or whether data were collected prospectively or retrospectively. SELECTION CRITERIA: We included diagnostic accuracy studies that evaluated at least one of the index tests (urea breath test using isotopes such as13C or14C, serology and stool antigen test) against the reference standard (histopathological examination using H & E stain, special stains or immunohistochemical stain) in people suspected of having H pylori infection. DATA COLLECTION AND ANALYSIS: Two review authors independently screened the references to identify relevant studies and independently extracted data. We assessed the methodological quality of studies using the QUADAS-2 tool. We performed meta-analysis by using the hierarchical summary receiver operating characteristic (HSROC) model to estimate and compare SROC curves. Where appropriate, we used bivariate or univariate logistic regression models to estimate summary sensitivities and specificities. MAIN RESULTS: We included 101 studies involving 11,003 participants, of which 5839 participants (53.1%) had H pylori infection. The prevalence of H pylori infection in the studies ranged from 15.2% to 94.7%, with a median prevalence of 53.7% (interquartile range 42.0% to 66.5%). Most of the studies (57%) included participants with dyspepsia and 53 studies excluded participants who recently had proton pump inhibitors or antibiotics.There was at least an unclear risk of bias or unclear applicability concern for each study.Of the 101 studies, 15 compared the accuracy of two index tests and two studies compared the accuracy of three index tests. Thirty-four studies (4242 participants) evaluated serology; 29 studies (2988 participants) evaluated stool antigen test; 34 studies (3139 participants) evaluated urea breath test-13C; 21 studies (1810 participants) evaluated urea breath test-14C; and two studies (127 participants) evaluated urea breath test but did not report the isotope used. The thresholds used to define test positivity and the staining techniques used for histopathological examination (reference standard) varied between studies. Due to sparse data for each threshold reported, it was not possible to identify the best threshold for each test.Using data from 99 studies in an indirect test comparison, there was statistical evidence of a difference in diagnostic accuracy between urea breath test-13C, urea breath test-14C, serology and stool antigen test (P = 0.024). The diagnostic odds ratios for urea breath test-13C, urea breath test-14C, serology, and stool antigen test were 153 (95% confidence interval (CI) 73.7 to 316), 105 (95% CI 74.0 to 150), 47.4 (95% CI 25.5 to 88.1) and 45.1 (95% CI 24.2 to 84.1). The sensitivity (95% CI) estimated at a fixed specificity of 0.90 (median from studies across the four tests), was 0.94 (95% CI 0.89 to 0.97) for urea breath test-13C, 0.92 (95% CI 0.89 to 0.94) for urea breath test-14C, 0.84 (95% CI 0.74 to 0.91) for serology, and 0.83 (95% CI 0.73 to 0.90) for stool antigen test. This implies that on average, given a specificity of 0.90 and prevalence of 53.7% (median specificity and prevalence in the studies), out of 1000 people tested for H pylori infection, there will be 46 false positives (people without H pylori infection who will be diagnosed as having H pylori infection). In this hypothetical cohort, urea breath test-13C, urea breath test-14C, serology, and stool antigen test will give 30 (95% CI 15 to 58), 42 (95% CI 30 to 58), 86 (95% CI 50 to 140), and 89 (95% CI 52 to 146) false negatives respectively (people with H pylori infection for whom the diagnosis of H pylori will be missed).Direct comparisons were based on few head-to-head studies. The ratios of diagnostic odds ratios (DORs) were 0.68 (95% CI 0.12 to 3.70; P = 0.56) for urea breath test-13C versus serology (seven studies), and 0.88 (95% CI 0.14 to 5.56; P = 0.84) for urea breath test-13C versus stool antigen test (seven studies). The 95% CIs of these estimates overlap with those of the ratios of DORs from the indirect comparison. Data were limited or unavailable for meta-analysis of other direct comparisons. AUTHORS' CONCLUSIONS: In people without a history of gastrectomy and those who have not recently had antibiotics or proton ,pump inhibitors, urea breath tests had high diagnostic accuracy while serology and stool antigen tests were less accurate for diagnosis of Helicobacter pylori infection.This is based on an indirect test comparison (with potential for bias due to confounding), as evidence from direct comparisons was limited or unavailable. The thresholds used for these tests were highly variable and we were unable to identify specific thresholds that might be useful in clinical practice.We need further comparative studies of high methodological quality to obtain more reliable evidence of relative accuracy between the tests. Such studies should be conducted prospectively in a representative spectrum of participants and clearly reported to ensure low risk of bias. Most importantly, studies should prespecify and clearly report thresholds used, and should avoid inappropriate exclusions

    On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

    Full text link
    Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.Comment: NAACL 202

    Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

    Full text link
    Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs.Comment: NLP Power! Workshop, ACL 202
    corecore